Search CORE

1,225 research outputs found

Eucalyptus research in the post-genome era

Author: D Grattapaglia
Dario Grattapaglia
DR Zerbino
Georgios Pappas
Marilia CR Pappas
Orzenil B Silva-Junior
Roberto C Togawa
Sergio de Alencar
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Author: A Edwards
A Edwards
B Ewing
D Hernandez
DA Wheeler
Daniel S. Rokhsar
DR Bentley
DR Bentley
DR Smith
DR Zerbino
DR Zerbino
ES Lander
EW Myers
EW Myers
EW Myers
Gary P. Schroth
GG Sutton
I Maccallum
Isaac Ho
J Butler
Jarrod A. Chapman
JC Roach
JL Weber
JT Simpson
K Hayashi
M Chaisson
M Margulies
M Pop
M Pop
MJ Chaisson
MJ Chaisson
ML Metzker
P Flicek
PA Pevzner
R Li
R Li
RL Warren
RM Idury
SC Schuster
SF Altschul
Shujun Luo
Sirisha Sunkara
Steven L. Salzberg
TW Jeffries
TW Jeffries
Publication venue: Public Library of Science
Publication date: 01/08/2011
Field of study

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

UNT Digital Library

Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs

Author: BG Jackson
DR Zerbino
EW Myers
Hieu Dinh
JD Kececioglu
JT Simpson
Matthew Vaughn
P Medvedev
PA Pevzner
S Batzoglou
Sanguthevar Rajasekaran
Vamsi K Kundeti
Vishal Thapar
X Huang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an <it>O</it>(<it>n/p</it>) time parallel algorithm has been given for this problem. Here <it>n </it>is the size of the input and <it>p </it>is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(<it>n</it>Σ) messages (Σ being the size of the alphabet). Results In this paper we present a Θ(<it>n/p</it>) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of <inline-formula><m:math name="1471-2105-11-560-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow><m:mo>Θ</m:mo><m:mo stretchy="false">(</m:mo><m:mfrac><m:mrow><m:mi>n</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>n</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow><m:mrow><m:mi>B</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>M</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mfrac><m:mo stretchy="false">)</m:mo></m:mrow></m:math></inline-formula> (<it>M </it>being the main memory size and <it>B </it>being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. Conclusions The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.</p

CiteSeerX

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Texas ScholarWorks

Evolutionary Sequence Analysis and Visualization with Wasabi

Author: A Löytynoja
A Löytynoja
A Veidenberg
A Yates
AJ Vilella
B Paten
BR Baum
DR Maddison
DR Zerbino
J Huerta-Cepas
J Zhang
K Katoh
MA Larkin
MN Price
MV Han
S Kumar
YS Cho
Z Yang
Z Yang
Publication venue: Humana press
Publication date: 01/01/2021
Field of study

Wasabi is an open-source, web-based graphical environment for evolutionary sequence analysis and visualization, designed to work with multiple sequence alignments within their phylogenetic context. Its interactive user interface provides convenient access to external data sources and computational tools and is easily extendable with custom tools and pipelines using a plugin system. Wasabi stores intermediate editing and analysis steps as workflow histories and provides direct-access web links to datasets, allowing for reproducible, collaborative research, and easy dissemination of the results. In addition to shared analyses and installation-free usage, the web-based design allows Wasabi to be run as a cross-platform, stand-alone application and makes its integration to other web services straightforward. This chapter gives a detailed description and guidelines for the use of Wasabi's analysis environment. Example use cases will give step-by-step instructions for practical application of the public Wasabi, from quick data visualization to branched analysis pipelines and publishing of results. We end with a brief discussion of advanced usage of Wasabi, including command-line communication, interface extension, offline usage, and integration to local and public web services.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Identifying the favored mutation in a positive selective sweep.

Author: A Ferrer-Admetlla
Ali Akbari
Arya Iranmehr
BF Voight
C Heffelfinger
CD Campbell
DR Schrider
DR Zerbino
G Coop
G Ewing
H Chen
J Ohashi
JJ Vitti
Joseph J Vitti
KJ Galinsky
M DeGiorgio
M Pybus
M Wang
MC Cornelis
MD Shriver
Mehrdad Bakhtiari
MI Jensen-Seaman
MW Nachman
NR Garud
P Azad
P Pavlidis
Pardis C Sabeti
PC Sabeti
PC Sabeti
PC Sabeti
R Nielsen
R Ronen
R Ronen
S Beleza
S Fan
S Gravel
S Wilde
SA Tishkoff
Siavash Mirarab
SR Grossman
T Stobdan
Vineet Bafna
Y Field
Y Kim
ZA Szpiech
Publication venue: eScholarship, University of California
Publication date: 01/04/2018
Field of study

Most approaches that capture signatures of selective sweeps in population genomics data do not identify the specific mutation favored by selection. We present iSAFE (for "integrated selection of allele favored by evolution"), a method that enables researchers to accurately pinpoint the favored mutation in a large region (∼5 Mbp) by using a statistic derived solely from population genetics signals. iSAFE does not require knowledge of demography, the phenotype under selection, or functional annotations of mutations

Crossref

eScholarship - University of California

QSRA – a quality-value guided de novo short read assembler

Author: D Hernandez
Douglas W Bryant
DR Zerbino
J Butler
J Dohm
J Kent
MJ Chaisson
NG de Bruijn
R Cronn
R Warren
Todd C Mockler
W Jeck
Weng-Keen Wong
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data. Results We have designed and implemented an assembler, Quality-value guided Short Read Assembler, created to take advantage of quality-value scores as a further method of dealing with error. Compared to previous published algorithms, our assembler shows significant improvements not only in speed but also in output quality. Conclusion QSRA generally produced the highest genomic coverage, while being faster than VCAKE. QSRA is extremely competitive in its longest contig and N50/N80 contig lengths, producing results of similar quality to those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo assembly of complex genomes, improving upon the original VCAKE algorithm by not only drastically reducing runtimes but also increasing the viability of the assembly algorithm through further error handling capabilities.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A robust SNP barcode for typing Mycobacterium tuberculosis complex strains

Author: A McKenna
A Stamatakis
B Langmead
B López
B Tessema
CB Ford
D Stucki
DR Zerbino
E Abadia
F Coll
F Coll
G Thwaites
H Li
H Zhang
I Comas
I Filliol
J Felsenstein
J Wang
JH Bates
M Caws
M Kato-Maeda
M Kato-Maeda
N Casali
P Nahid
PF Barnes
PL Lin
R Firdessa
S Feuerriegel
S Gagneux
S Homolka
T Derrien
T Weniger
TA Rado
Y Blouin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Strain-specific genomic diversity in the Mycobacterium tuberculosis complex (MTBC) is an important factor in pathogenesis that may affect virulence, transmissibility, host response and emergence of drug resistance. Several systems have been proposed to classify MTBC strains into distinct lineages and families. Here, we investigate single-nucleotide polymorphisms (SNPs) as robust (stable) markers of genetic variation for phylogenetic analysis. We identify ~92k SNP across a global collection of 1,601 genomes. The SNP-based phylogeny is consistent with the gold-standard regions of difference (RD) classification system. Of the ~7k strain-specific SNPs identified, 62 markers are proposed to discriminate known circulating strains. This SNP-based barcode is the first to cover all main lineages, and classifies a greater number of sublineages than current alternatives. It may be used to classify clinical isolates to evaluate tools to control the disease, including therapeutics and vaccines whose effectiveness may vary by strain type

CiteSeerX

Crossref

LSHTM Research Online

Repositório da Universidade Nova de Lisboa

PubMed Central

Birkbeck Institutional Research Online

LSHTM Data Compass

Safe and complete contig assembly via omnitigs

Author: A Bankevich
A Guénoche
AR Rubinov
AS Motahari
C Kingsford
D Haussler
DR Zerbino
E Kapun
E Kapun
ES Lander
G Bresler
G Narzisi
I Lysov
JD Kececioglu
JR Miller
JT Simpson
JT Simpson
K Lam
K Sahlin
L Salmela
M Boetzer
M Boetzer
N Nagarajan
N Nagarajan
N Vyahhi
P Medvedev
P Medvedev
P Medvedev
PA Pevzner
PA Pevzner
R Chikhi
R Chikhi
R Luo
R Uricaru
RM Idury
SL Salzberg
Publication venue
Publication date: 16/08/2016
Field of study

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph

G

(e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from

G

as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201

arXiv.org e-Print Archive

Crossref

SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data

Author: A Martínez-Alcántara
Daniel A Peterson
DR Zerbino
GJ Hannon
J Rougemont
JC Dohm
ML Metzker
Murray P Cox
P Pavlidis
Patrick J Biggs
PC Dolan
PJA Cock
R Development Core Team
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Illumina's second-generation sequencing platform is playing an increasingly prominent role in modern DNA and RNA sequencing efforts. However, rapid, simple, standardized and independent measures of run quality are currently lacking, as are tools to process sequences for use in downstream applications based on read-level quality data. Results We present SolexaQA, a user-friendly software package designed to generate detailed statistics and at-a-glance graphics of sequence data quality both quickly and in an automated fashion. This package contains associated software to trim sequences dynamically using the quality scores of bases within individual reads. Conclusion The SolexaQA package produces standardized outputs within minutes, thus facilitating ready comparison between flow cell lanes and machine runs, as well as providing immediate diagnostic information to guide the manipulation of sequence data for downstream analyses.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes

Author: B Chevreux
DD Sommer
DR Zerbino
DR Zerbino
DW Bryant
EW Myers
GG Sutton
I Maccallum
J Butler
JT Simpson
MJ Chaisson
Mohammed Sahli
P Medvedev
PA Pevzner
R Li
RL Warren
Tetsuo Shibuya
X Huang
X Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref